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ESTIMATION IN FUNCTIONAL REGRESSION FOR GENERAL 
EXPONENTIAL FAMILIES 

By Winston Wei Dou*, David Pollard^ and Harrison H. Zhou^ 

. Yale University 

CN ■ This paper studies a class of exponential family models whose canon- 

^j^l ical parameters are specified as linear functional of an unknown infinite- 

. dimensional slope function. The optimal minimax rates of convergence for 

slope function estimation are established. The estimators that achieve the op- 
timal rates are constructed by constrained maximum likelihood estimation 
■ with parameters whose dimension grows with sample size. A change-of- 

measure argument, inspired by Le Cam's theory of asymptotic equivalence, 
is used to eliminate the bias caused by the nonlinearity of exponential family 
' models. 

• 1. Introduction. There has been extensive exploratory and theoretical study 

a . of functional data analysis (FDA) over the past two decades. Two monographs 

by Ramsay and Silverman (2002, 2005) provide comprehensive discussions on the 
methods and applications. 

Among many problems involving functional data, slope estimation in func- 
^ ! tional linear regression has received substantial attention in literature: for example, 

ly-^ I by Cardot, Ferraty and Sai^da (2003), Li and Hsing (2007), and Hall and Horowitz 

lO ■ (2007). In particular. Hall and Horowitz (2007) established minimax rates of con- 

vergence and proposed rate-optimal estimators based on spectral truncation (re- 



^ I gression on functional principal components). They showed that the optimal rates 

depend on the smoothness of the slope function and the decay rate of the eigenval- 
ues of the covariance kernel. 

In this paper, we study optimal rates of convergence for slope estimation in func- 
tional generalized linear models, for which little theory is available. We introduce 
. several new technical devices to overcome the problems caused by nonlinearity 

5^ I of the link function. To analyze our estimator, we establish a sharp approximation 

for maximum likelihood estimators for exponential families parametrized by linear 
functions of m-dimensional parameters, for an m that grows with sample size (see 
Lemma 1). We develop a change-of-measure argument — inspired by ideas from 
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Le Cam's theory of asymptotic equivalence of models — to eliminate the effect of 
bias terms caused by the nonlinearity of the link function (see Section 3.3 and 3.4). 

More precisely, we consider problems where the observed data consist of inde- 
pendent, identically distributed pairs (yj, X^) where each Xj is a Gaussian process 
indexed by a compact subinterval of the real line, which with no loss of generality 
we take to be [0, 1]. We denote the corresponding norm and inner product in the 
space L^[0, 1] by || • || and (•, •). 

We assume, for each i, that the random variable yi conditional on the process Xj, 
follows a distribution Qx^, where {Qx : A G M} is a one-parameter exponential 
family. We take parameter Aj to be a linear functional of Xj of the form 

(1) \ = a+ [ Xi{t)M{t)dt 
Jo 

for an unknown constant a and an unknown B G L^[0, 1]. 
We focus on estimation of B using integrated squai^ed error loss: 

L(B,B„) = ||B -B„f = ^ (M{t) -Mn{t)y dt. 

Our models are indexed by parameters / = (K, a, fi, B), where jj, is the mean 
and K is the covariance kernel of the Gaussian process. The universal constant a 
controls the decay rate of eigenvalues of kernel K and /? characterizes the 'smooth- 
ness' of the slope function B. See Definition 1 (in Section 2) for the precise speci- 
fication of the parameter set T = F{R, a, j3). The two main results are as follows. 

Theorem 1 . {Minimax Upper Bound ) Under the assumptions stated in Sec- 
tion 2, there exists an estimating sequence ofMn'sfor which: for each e > there 
exists a finite constant such that 




for large enough n. 



Theorem 2. {Minimax Lower Bound ) Under the assumptions stated in Sec- 
tion 2, 

liminf n^^/^-^^/^^+^z?) supP„ .||B-B„f > for every estimator {§„}. 

Two closely related works in functional data analysis are Cardot and Sarda (2005) 
and Miiller and Stadtmiiller (2005), which provided theory for the functional gen- 
eralized linear model, including the rates of convergence for prediction in the ran- 
dom design case. However, the rate optimalities were not studied. In addition. 
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Miiller and Stadtmiiller (2005) established an upper bound for rates of convergence 
assuming the negligibility of the bias due to the approximation of the infinite- 
dimensional model by a sequence of finite-dimensional models, the issue we over- 
come by using a change-of-measure argument. In the functional linear regression 
setting, Cai and Hall (2006) and Crambes, Kneip and Sarda (2009) derived optimal 
rates of convergence for prediction in the fixed and random design cases. See also, 
Cardot, Mas and Sarda (2007) which derived a CLT for prediction in the fixed and 
random design cases and Cardot and Johannes (2010) which established a mini- 
max optimal result for prediction at a random design using thresholding estima- 
tors. In a companion study to our paper, Dou (2010, Chapter 5) considers optimal 
prediction in functional generalized linear regressions with an application to the 
economic problem of predicting occurrence of recessions from the U.S. Treasury 
yield curve. 

Our minimax upper bound result (Theorem 1) is proved in Section 3. The mini- 
max lower bound result (Theorem 2) is established in Section 4. The proof of The- 
orem 1 depends on an approximation result (Lemma 1) for maximum likelihood 
estimators in exponential family models for parameters whose dimensions change 
with sample size. As an aid to the reader, we present our proof of Theorem 1 in 
two stages. In Section 3.3, we assume that both the mean n and the covariance ker- 
nel K are known. This allows us to emphasize the key ideas in our proofs without 
the many technical details that need to be handled when ^ and K are estimated 
in the natural way. Many of those details, as summarized in Lemma 5, involve 
the spectral theory of compact operators. We proceed in Section 3.4 to the case 
where ^ and K are estimated. The proofs for the lemmas are collected together in 
Section 5. Some of them invoke the perturbation-theoretic results collected in the 
supplemental Appendix. 

2. Regularity conditions. Let {Q\ : A G M} be a one-pai^ameter exponential 
family, 

(2) dg^/dQc = fx{y) ■■= exp(Ay - V(A)) for all A e M. 

Necessarily V'(O) = 0. Remember that e^^'^^ = Qoe^^ and that the distribution Qx 
has mean '(/'(A) and variance Tp{X). 

Remark. We may assume that ip{X) > for every real A. Otherwise we 
would have = tpiXo) = varAo(y) = Qnfxo{y){y - '/'(Ao))^ for some Aq, 
which would make y — tp{Xo) for Qq almost all y and Q\ = for every A. 

We assume: 

(V') For each e > there exists a finite constant for which '(/'(A) < exp(eA^) 
for all A G M. Equivalendy, 'ip{\) < exp (o(A^)) as |A| — > oo. 
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(V') There exists an increasing real function G on M+ such that 

|^'(A + h)\< i;{X)G{\h\) for all A and h. 

Without loss of generahty we assume G(0) > 1. 
As shown in Section 5.3, the assumption (V') implies that 

(3) h\Qy,,Qx+s)<8^{\){l + \5\)G{\8\) for all A, 5 gM, 

which plays a key role in analyzing both upper and lower bounds. 

We assume the observed data are iid pairs (yj, Xj) for i = 1, . . . , n, where: 

(X) Each {Xi(t) : < t < 1} is distributed like {K{t) : < t < 1}, a Gaussian 

process with mean ij,{t) and covariance kernel K{s, t). 
(Y) 1/j I Xj ~ Qx, with \i = a+ (Xj, B) for an unknown {B(t) : < i < 1} in 

L2[0,1] anda e M. 

Dehnition 1. For real constants a > 1 and /3 > (a + 3)/2 and R > 0, 
define J- = J-{R, a, /3) as the set of all f = {K, a, fi, B) that satisfy the following 
conditions. 

(K) The covariance kernel is square integrable with respect to Lebesgue measure 
and has an eigenfunction expansion (as a compact operator on -Zj^[0, l]j 

where the eigenvalues 9^ are decreasing with Rk^°^ >dk^ dk+i+io(/ R)k^ 
(a) \a\ < R 

f^i ll^ll < R 

(M) B has an expansion B(t) = ^fc0fc(i) with < Rk~^,for the eigen- 

functions defined by the kernel K. 

Remarks. The awkward lower bound for 9k in Assumption (K) implies, for 
all k < j, 

(4) 0k - 9j > r ax-^-^dx = R-^ (fc"" - r") . 

Jk 

If K and /i were known, we would only need the lower bound 6^ > R^^k'" 
and not the lower bound for 9k — 9k+i- As explained by Hall and Horowitz 
(2007, page 76), the stronger assumption is needed when one estimates the 
individual eigenfunctions of K. Note that the subset of [0, 1] in which B lies, 
denoted as Bk, depends on K. We regard the need for the stronger assumption 
on the eigenvalues and the irksome Assumption (B) as artifacts of the method 
of proof, but we have not yet succeeded in removing either assumption. 
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More formally, we write for the distribution (a probability measure on the 
space L^[0, 1]) of each Gaussian process Xj. The joint distribution of Xi, . . . , X„ is 
then Fn,tJ.,K = PJ^k- identify the y^'s with the coordinate maps on M" equipped 
with the product measure Qn,a,B,Xi,...,x„ := ®i<nQ\i, which can also be thought 
of as the conditional joint distribution of {yi, ■ ■ ■ ,yn) given (Xi, . . . , X„). Thus 
the P„ J in Theorems 1 and 2 can be rewritten as an iterated expectation, 



the second expectation on the right-hand side averaging out over yi , . . . , y„ for 
given Xi, . . . , X„, the first averaging out over Xi, . . . , X„. To simplify notation, 
we will often abbreviate Qn,a,B,Xi,...,x„ to Qn,a,B- 

3. Proof of Theorem 1. The proof of Theorem 1 will be divided into two 
stages. In the first stage, we prove the theorem assuming that the covariance kernel 
K is known. This case is relatively simple and of course artificial, but it captures 
the essence of the idea of our proof. In the second stage where K is unknown, 
we shall show that using the natural estimate K as in (5) will not affect the result 
achieved in the first stage. Lemma 5 is to control the gap between the two stages. 

In Section 3.1 we introduce the methodology of constructing a sequence of esti- 
mators achieving the optimal rates of convergence. In Section 3.2 we state the tech- 
nical lemmas which serve as building blocks for establishing the main theorems. 
Their proofs are postponed to the Section 5. In Section 3.3 we prove Theorem 1 
assuming fj, and K are known, and then in Section 3.4 we complete the proof of 
Theorem 1 with unknown jj. and K. 

3.1. Methodology. Under the assumptions (X) and (K) from Section 2, the 
process Xj admits the eigen decomposition: 



The random variables Zj ^ := (Zj, (pj.) are independent with Zj ^ ~ -^(0, 9k)- 
Because and K are unknown, we estimate them in the usual way: Jln{t) = 



P, 



n,/ — ^n,fj.,KQn,a,M,Xi,...,X, 




m=n-'E^<nMt)md 



(5) 



K{s,t) = {n-l)-^y2 (X^(s)-X(s)) (Xi(t) 



X(t)) 



which has spectral representation 
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with 9i > 02 > ■ ■ ■ > On-i > 0. In fact we must have = for A; > n 
because all the eigenfunctions (p^ corresponding to nonzero ^^'s must lie in the 
n — 1-dimensional space spanned by {Zj — Z : i = 1, 2, . . . , n}. 

Using the first (as defined in (9)) principal components, we can approxi- 
mate the original infinite-dimensional model by the following sequence of trun- 
cated finite-dimensional models: 

yi\Xi, ■■■ ,Xn ~ Qr 

with ^ ^ 

where 6o = a + (B,l), and 6^ = (B, ^j) for j > 1, and Zij = (Xj - 1, 4>j). 

We estimate B by 
(6) ^ = yZ <- ^i^i' 

where (6o) ■ " " > ^Tv) is the conditional MLE for the truncated model and m < N . 
More precisely, (6o, • • • , b^) is chosen to maximize the following conditional (on 
the Xi's) log likelihood over {gQ,gi, - • • , (7Ar) in M^+^: 
(V) 

£n(5o, 51, ■■■ ,9n) = X]i<„ y'(fo + I^j<jv Sj'^ij) - ^(90 + 5^j-<jv 
with 

(8) m X ni/("+2/') 
and 

(9) Nr^n^ with (2 + 2a)"^ > C > (a + 2/3 - 1)-^ 

Note that is much larger than m. Such a C exists because the assumptions a > 1 

and /3 > (q + 3)/2 imply a + 2/3 - 1 > 2 + 2q. 

3.2. Technical lemmas. We shall first introduce an approximation result for 
maximum likelihood estimators in exponential family models for parameters whose 
dimensions change with sample size. This lemma combines ideas from Portnoy 
(1988) and from Hjort and Pollard (1993). We write our results in a notation that 
makes the applications in Sections 3.3 and 3.4 more straightforward. The nota- 
tional cost is that the parameters are indexed by {0, 1, ... , N}. To avoid an excess 
of parentheses we write for + 1. In the applications N changes with the 
sample size n and Q is replaced by Qn,a,M,N or Qn,a,M,N- For each square matrix 
A, the spectral norm is defined by \\A\\2 := sup|„|<]^ \ Av\ where \v\ denotes the 
norm of vector v. 
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Lemma 1. Let Q\ be the one-parameter exponential family distribution de- 
fined as in (2) and satisfying regularity condition Suppose , . . . , ^„ are ( non- 
random) vectors in Suppose Q = 0i<nQxi with Aj = ^'fi for a fixed 7 = 
(70, 7i, . . . , 7Ar) in Under Q, the coordinate maps yi, . . . ,yn are indepen- 
dent random variables with yi ~ Qx.. 

The log-likelihood for fitting the model is 



Ln{g) = V {i[g)yi - i>{i[g) for g e 



which is maximized ( over ) at the MLE g {=gn)- Suppose = Dru for some 
nonsingular matrix D, so that 

Jn = nDAnD' where An ■= - T^^.^ ■qi'q['ii){\i) . 
If Bn is another nonsingular matrix for which 

(10) P„-5„||2 < (2||S-1||2)"' 

and if 

(11) maxj<„ |r/j| < ^V^/N^ some < e < 1 

G{l)yJ'i2\\Bn% 

then for each set of vectors kq, . . . , km in there is a set y^^e with Q^^ e < 2e 
on which 

^0<j<M' ii\ — ^o<i<A/ ' 

The following approximation result for random matrices will be invoked in order 
to apply the Lemma 1 to show Theorem 1. 

Lemma 2. Suppose {'{]i^k ■ i,k > 1} are Ltd. standard normal random vari- 
ables. Let 

(12) An = n-^y,^ 7]irj',4,{^'D7]i), 

where-f = (7o,7ir-- ,1n)' , m = {l,r]i^i, . . . ,r]i^Ny, and D = diag{Do, Di, ■ ■ ■ , 
Denote Bn = P^n and assume ip satisfies condition (iIj). If X^fc>i -C^tI < 00 
andN = (n^^/^), it follows that ||-B^^||2 = Ojr{l) andF\\An- Bn\\2 = oj-(l)- 

The following lemma establishes a bound on the Hellinger distance between 
members of an exponential family, which is the key to our change of measure 
argument. We write h(P, Q) for the Hellinger distance. 
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Lemma 3. Suppose {Q\ : A G M} is an exponential family defined as in (2) 
and satisfies regularity condition (ip). Then, 

h\Qx,Qx+5)<S^{X){l + \6\)G{\6\) V X,6eR. 

The following lemma provides a maximal inequality for weighted-chi-square 
variables, which easily leads to maximal inequalities for Gaussian processes and 
multivariate normal vectors. These inequalities will be repeatedly invoked. 

Lemma 4. Suppose Wi = J2keN'^i,>'^i kf^^ ^ ~ li • • • i^^. where the rjiys 
are independent standard normals and the Ti^s are nonnegative constants with 

oo >T := maxi<„ XlfceN Then 

P{maxi<„ Wi > 4r(log n + x)} < 2e"^ for each x > 0. 

When we want to indicate that a bound involving constants c, C, Ci, . . . holds 
uniformly over all models indexed by a set of parameters T, we write c(J^), C{F), 
Ci(J^), .... By the usual convention for eliminating subscripts, the values of the 
constants might change from one paragraph to the next: a constant Ci {F) in one 
place needn't be the same as a constant Ci{F) in another place. For sequences of 
constants c„ that might depend on F, we write c„ = Oj'(l) and ojr(l) and so on 
to show that the asymptotic bounds hold uniformly over F. 

Lemma 5. Let Xi , • • • , X„ be i.i.d. Gaussian processes satisfying (X) and(K). 
Let m and N be integers defined as in (8) and (9) respectively. Suppose Hp and 
Hp are orthogonal projections operators associated with span{(/)i, • • • ,(j)p} and 
span{(^i, • • • , (pp}. Define the matrix S := diag(cro, . . . , cn) with do = 1 and 
(Tfc = sign((0fc, (j)k)) for k > 1. The key quantities are: 

(i) A:= K- K 

(ii) D = dmgil,9i, . . . ,9Ny/^ 

(Hi) Zi = {zi^i, Zi^N)' where Zi^k = (^i, 4>k) 

(iv) z. = [z.i, . . ..z.n)' where z.^ = {'LAk) = n~'^ Yli<n^i,k 

= (1; 2^ ~ '3«<i rji = D^^ii. [We could define rji = D^^^i but then we 
would need to show that D^^S,i ^ D~^S,i- Our definition merely rearranges 
the approximation steps.] 

(vi) 7 := (70, 61, ... , bN)' where B = J2keN ^k4>k and 70 := a + (B, X). [Note 
that Ai = 7o + (B,Zi-Z).7 

(vii) Xi,N = 70 + {HnM, - Z) = 

(Viii) An = n"^ T.i<n^i%^(^i,N) 
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For each e > there exists a set X^^n, depending on fj, and K, with 

SUPjr Fn,fi,K'^e,n < ^ l^^g^ CnOUgh U 

and on which, for some constant that does not depend on n or K, 

(i) II A II < C,n-V2 

(ii) maxj<„ ||Zj|| < \J log n and ||Z|| < CfinT^I'^ 
(Hi) \\{H,n - H^)Mf = o^ipn) 

(iv) \\{Hj\[ — Hn)^\\'^ = Ojr{n^^^'^) for some v > that depends only on a 
and (3 

(v) maxi<„ Ir/ip = ojr{^/N) 

(vi) ||5l„S- A„||2 = o^(l) 

3.3. Proof of Theorem 1 with known Gaussian distribution. Initially we sup- 
pose that p and K are known. Under Q„ = Qn,a,M, the y^'s are independent, with 

Vi ~ Q\, and 

Ai = a + (Xj, B) = 6o + Zi kbk where bo = a + {fi, B). 

Our task is to estimate the 6fc's with sufficient accuracy to be able to estimate 
^(t) = T.kenbk4>k{t) within an error of order pn = n(i-2/3)/(a+2/3)_ f^^j 
it will suffice to estimate the component H-mB of B in the subspace spanned by 
{(pi, . . . , (pm} with m X n^/i'^+'^l^) because 

(13) \\H^Mf = Y,,^ bl = OAm'-^^) = OAPn). 

We might try to estimate the coefficients (6o) • • • ) &m) by choosing g — (s'O) • • • ) Qm) 
to maximize a conditional log likelihood over all g in 

E.^ Vi\,m - tp{K,m) with Xi m =90 + ^, ^ Zi ^gk- 

i<n ^ — ^l<k<m 

To this end we might try to appeal to Lemma 1 stated at the beginning of this 
Section, with kj equal to the unit vector with a 1 in its jth position for j < m and 
Kj = Otherwise. That would give a bound for X] ^ <,„ {gj — 7j ) ^ ■ Unfortunately, we 
cannot directly invoke the Lemma with N = mto estimate 7 = (60, 61, . . . , 6Ar) 
when 

Q = Qn,a,B and Z) = diag(l,0i,...,0;v)'/' 

(14) (i = {l,Zi^i,...,Zi^N) and rj- = {l,r]i^i, . . . ,r]i^N), 

because Aj 7^ ^^'7, a bias problem. Note that in this case r]ij = Zij/\/Wi for all i, j 
and hence the ijij's are i.i.d. standard normal variables. 
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Remark. We could modify Lemma 1 to allow — ^^'7 + bias^, for a suitably 
small bias term, but at the cost of extra regularity conditions and a more deli- 
cate argument. The same difficulty arises whenever one investigates the asymp- 
totic s of maximum likelihood with the true distribution outside the model fam- 
ily. 

Instead, we use a two-stage estimation procedure that eliminates the bias term 
by a change of measure conditional on the Xj's. We shall present the proof in the 
following three steps. 

Step 1. From the analysis above, one can see that the key in our proof is the 
change-of-measure argument and the application of Lemma 1. In this step, we shall 
construct a high probability set such that for each realization of the X^'s on the set 
the assumptions of Lemma 1 are satisfied and the change-of-measure argument is 
ready to work. 

Define ^j, D, and r/j as in equation (14). Then we define matrix An as in (12) 
and choose Bn := "^n.fi.K-^n- Define X„ = Xz,n H X^^^ n X^^„, where 

(15) Xz,n := {maxj<„ < Cologn} 

(16) 1r,,n ■= {maxj<„ |r/ip < CoNlogn} 

(IV) XA,n ■■= {\\An - Bnh < {2\\B-^ h)-^} 

If we choose a large enough universal constant Co = Co{T), Lemma 4 ensures that 

^n,ii,K'^z,n ^ 2/nandP„,^,A'3^^^,„ < 2/n by choosing = andrj^fc = {i < N] 
respectively for all i, k; and Lemma 2 shows that 

ll^n ^Ib = 0^(1) and ¥n,f,,K\\An - BnWl = 
thus Pn,n,K'^A n ~ ^J^i^)- ^^'^ hcUCC, 

(18) Fn,fj.,K'X.n < Fn,fj.,K'^Z,n + IF'n,/i,A'X5j „ + Fn,fj.,K'X.A,n = Oj-(l). 

Step 2. Let us consider the approximate distribution 

Qn,a,M,N ■= ®i<nQ\,^M ^ith Aj^AT := ^^7 and i = {bo, 61, ... , 6Ar). 

In this step, we show that the divergence caused by replacing Qn,a,M by Qn,a,M,N is 
small enough that it will not compromise the asymptotic results. In replacing Qn,a,M 
by Qn,a,M,N wc eliminate the bias problem but now we have to relate the proba- 
bility bounds for Qn,a,M,N to bounds involving Qn,a,M- A common control of this 
divergence is the total variation distance between Qn,a,M,N and Qn,a,B- We shall 
show that there exists a sequence of nonnegative constants c„ of order oj-(logn), 
such that 

(19) ||Qn,a,B-Qn,a,B,iv|||v <e^'"'^.^ |Ai-Ai,7v|^ OU X„. 
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To establish inequality (19) we use the bound 

\\Qn,a,M - Qn,a,B,iv||TV < h^(Qn,a,B, Qn,a,B,iv) < .<„ (Q;,^ , Q;,^^^ ) 

By Lemma 3 

h\Qx„Qx,^J<5fi^{X,){l + mg{m 

where 

= \X^ - \i,N\ = - {HNI^^M 

= \{Zi,Hj^M)\ < \\Zi\\\\Hj^M\\ 

Thus all the (1 + g(|(5i|) factors can be bounded by a single OjiV) term. 
For (a, B, K) G ^{R-, a, /3) and with the ||Zi||'s controlled by X„, 

|A,| < \a\ + (||/i|| + ||Zi||)||]B|| < CsVb^ 

for some constant C2 = C2(-F). Assumption (^) then ensures that all the 
are bounded by a single exp (ojr(log n)) term. 

Step J. On the set X„, we can apply Lemma 1 directly with Q = Qn,a,B,Af5 
because inequality (10) holds by construction and inequality (11) holds for large 
enough n because 

maxi<„ \r]i^ < Ojr{N\ogn) = ojr{^/n/N). 

Estimate 7 by the g = {jj^, . . . ^g^) defined in Lemma 1. Thus, the estimator 
in Theorem 1 is B„ = J2i<k<m9k4'k- For each realization of the Xj's in X„, 
Lemma 1 gives a set ^rn,e with Qn,a,M,N'^m e < 2e on which 

— ^l<k<m V ^ — ^l<k<m J 

which implies 

||B„-Bf = V |^,_^,|2 + V bl = OAPn)- 

— 'l<fe<m ^ — ^k>m 

From the inequality (19) it follows, for a large enough constant C^, that 

Pn,/.,KQn,a,B{||B„-Bf > C,Pn} 

< '^n,fi,K'^n + '^n.fi.K'^n (||Qn,a,B — Qn,a,B,Ar || TV + Qn.a.B.TvVm.e) 

1 /2 
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By construction, 

Aj — Aj AT = > ^, ^i,fc&A: 

with the Zj^fc's independent and Zi^k ~ -^(0, ^a,)- Thus 
5]^<^Pn,M,i^|A. - A,,;vP < nY,,^^(^kbl = 0^(niVi-"-2^) = o^(e-2'=") 

because ( > (a + 2/3 — That is, we have an estimator that achieves the 

Or{Pn) minimax rate. 

3.4. Proof of Theorem 1 with unknown Gaussian distribution. As before, most 
of the analysis will be conditional on the Xj's lying in a set with high probability 
on which the various estimators and other random quantities are well behaved. Re- 
member „ is the high probability set defined in Lemma 5. For the key quantities 
defined in Lemma 5, we shall keep their notations unchanged in this section for the 
purpose of making the application more straightforward. 

As before, the component of B orthogonal to span{(/)i, . . . , 4>m} causes no trou- 
ble because ^ 

— 'l<fc<m 

and, by Lemma 5 part (iii), 

\\Hi^\'' <2\\HiBf + 2\\{Hm- HmM'' = Or{pn) onX,,n. 

To handle Y^i<k<m^k — 7fc)^> invoke Lemma 1 for Xj's in Xe,„, with r/j replaced 
by rji and An replaced by An and Bn replaced by i?„ = SBnS, the same Bn and D 
as before, and Q equal to 

Qn,a,M,N = ®i<nQx. „• 

to get a set Vlm,e with Q„,a,B,AfVm,e < 2e on which Y.i<k<mi9k - Ikf - The 
conditions of Lemma 1 are satisfied on Xe,n, because of Lemma 5 part (v) and 

\\An - Bnh < \\An " SAnSh + \\SAnS - SBnSh = 0^(1). 

To complete the proof it suffices to show that ||Qn,a,B,Af — Qn,a,B,7v||TV tends to 
zero. First note that 

Xi,N - Xi,N = a + (B, X) + {HnM, - Z) - a - (B, fi) - {HnM, Z^) 
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which impUes that, on X^^^, 

\\n - A^,7v|' < 2\{H^M,Z)\^ + 2||M - HNMf {\\Zi\\ + ||Z||)' 

(20) = Oj-(n"^"''') for some < u' < u. 

Now argue as in the step 2 of the proof for the case of known K: on Xe,n> 

||Qn,a,B,Af — Qn,a,B,7V ||tv — /.-^ ( Q\ „^Q\ N ) 

< exp(ojr(logn))y' \Xi N - Xi,N\'^ = Ojr{l). 
^ — ^t<n 

Finish the argument as before, by splitting into contributions from „ and X^ „ n 

and X^^n n ym,e- 

4. Proof of Theorem 2. We apply a slight variation on Assouad's Lemma — 
combining ideas from Yu (1997) and from van der Vaart (1998, Section 24.3) — to 
establish the minimax lower bound result in Theorem 2. 

We consider behavior only for /x = and a = 0, for a fixed K with spectral 
decomposition J2jeN^j'^j ^ 4>j- For simplicity we abbreviate Pn,o,A' to P. Let 
J = {m + 1, m + 2, . . . , 2m} and V = {0, 1}-^. Let Pj = Rj~'^. For each 7 in T 
define My = ^J2je.j ^jl^j'Pjy for ^ small e > to be specified, and write for 
the product measure ^i<nQxi{'y) with 

For each j let Pj = {7 E P : jj = 1} and let ipj be the bijection on P that flips 
the jth coordinate but leaves all other coordinates unchanged. Let tt be the uniform 
distribution on P, that is, vr^ = 2^™ for each 7. 

For each estimator B = ^^^f^ bjcpj we have ||B^ — B|p > j (^Ijf^j — ^j^ 
and so 

supP„j||B-Bf > E.er^^E.ej'P'^^ {'^^^^ 

(21) > 2-™E,,,E,,r, \m'n%^%Mi 

the last lower bound coming from the fact that 

(e/3j- -bjf + (0 - > l(e/3j)2 for all 6^. 
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We assert that, if e is chosen appropriately, 

(22) ™iiij,7 IPIIQ7 A Qt/.j(7) II stays bounded away from zero as n — ^ 00, 

which will ensure that the lower bound in (21) is eventually larger than a constant 
multiple of YljeJ f^j — ^Pn for some constant c > 0. The inequality in Theorem 2 
will then follow. 

To prove (22), consider a 7 in F and the corresponding 7' = ^^(7). By virtue of 
the inequality 

1 /2 

IIQ^ A Qy II = 1 - IIQ^ - Qy IItv > 1 - (2 a ^.^^ h\Qx,ij),Qx,iY))) 
it is enough to show that 

(23) Iimsup„^oomaxj-^P(^2 A^,^^/i^(QA,(7),<5A,(y))) < 1- 

Define X„ = {maxj<„ ||Zj|p < Co log n}. Based on Lemma 4, we know that 
FX'^ = 0(1) with the constant Co large enough. On X„ we have 

|A,(7)P <Y. -.r ^m^^ll' = = ^(1) 

and, by inequality (3), 

^'(QA.(7)>QA.(y)) < 0^(1)|A,(7) - A,(y)P < e^O^m^zl^. 
We deduce that 

^(2^E.<„^'(^A.(7),QA,(y))) < 2PX5^ + j;^^^e2o^(i)^2p3.^^2^. 

<o{l)+e^O{l)nPpj. 
The choice of J makes < R^m~''^-^l^ ~ R'^/n. Assertion (23) follows. 
5. Proof of technical lemmas. 

5.1. Proof of Lemma 1. We need to first show the following lemma. Define 

(i) Wi := Jn ^^"^ii, an element of ]R^+ 

(ii) Wn = J2i<n {vi " V'(Ai)) > an element of 

Notice that QM^ = and varQ(Ty„) = Y.i<n ^Mi'iK) = In+ and 
QlWnl"^ = trace (varQ(H/'„)) = N+. 
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Lemma 6. Suppose < ei < 1/2 and < 62 < 1 and 

maxj<„ \ wi\ < with G as in Assumption (■^). 

2Cr(ljiV_|_ 

Then g = ^ + Jn ^^"^{Wn + r„) with |r„| < ei on the set {\Wn\ < a/ N+/^2\, 
which has Q-probability greater than 1 — £2- 

Proof. The equality Q|H^„p = and Tchebychev give 

<^{\Wn\ > VN+/e2} < €2. 

1 /2 

Reparametrize by defining t = Jn {g — l)- The concave function 

1/2 ^ 

is maximized at t„ = Jn (5 — 7)- It has derivative 

'^n{t) = y] ( - ip{Xi + wit)) . 

For a fixed unit vector u G and a fixed t G M^+, consider the real- valued 
function of the real variable s, 

H{s) := u'tn{st) = X]j<„ ~ + swltj) , 

which has derivatives 

H{s) = (nV)Kt)'V''(Ai + sw'it). 

• * 7. n. 



Notice that F(0) = u'Wn and ij(0) = -n' •<„ ^^^^^^^(Aj)* = -u't. 
Write Mn for maxi<„ lu'il. By virtue of Assumption 



= M„G(M„|si|)|t|2. 
By Taylor expansion, for some < s* < 1, 

\H{1) - H{0) - ij(0)| < l\His*)\ < ^MnG{Mn\t\) \tf. 
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u' (tnit) -Wn + t) < WnG (M„|t|) 



That is, 
(24) 

Approximation (24) will control the behavior of L{s) := ^n{Wn + su), a concave 
function of the real argument s, for each unit vector u. By concavity, the derivative 

L{s) = u'tn{Wn + su) = -S + R{s) 

is a decreasing function of s with 

\R{S)\ < WnGiMn\Wn + Su\)\Wn + Su\^ 



On the set {| W„| < y^V+z^} we have 

\Wn±eiu\ < y/N+/e2 + ei. 



Thus 



implying 



eie2 



Mn\Wn ± eml < ^^^^ ( VN+/e2 + ei ) < 1, 



|i?(±ei)| < ^MnG{l)\Wn±eiu\^ 

< ei (1 + e?e2/A^+) < fei- 



Deduce that 



^{ei) = -ei + R{ei)<-lei 



L{-ei) = ei + R{-ei) > fei 

The concave function s i— ?■ Ln{Wn + su) must achieve its maximum for some s in 
the interval [— ei, ei], for each unit vector u. It follows that \tn — Wn\ < ei- □ 

First we establish a bound on the spectral distance between and B^^. De- 
fine H = B-^An - I. Then||F||2 < ||S^^||2||A„ - Bnh < 1/2, which justifies 
the expansion 

\\A-'-B-% = II {{I + H)~' - I) B-% < Y,^^^ Il^ll2l|i?n'll2 < ||i?^'||2. 

As a consequence, ||A~^||2 < 2||i?,^-'-||2. 
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Choose ei = 1/2 and e2 = e in Lemma 6. The bound on maxj<„ \r]i\ gives the 
bound on maxj<„ | Wi \ needed by the Lemma: 

n\wif = T]'iD{Jn/n)-^Dri, = n'^A-^iqi < \\A-^\\2\ri,f . 

Define Kj := Jn^^\j, so that Wj{g - < 2{K'-Wnf + 2{K'.rnf. By 
Cauchy-Schwarz, 

^.K-r„)2 < Y,. \KjWn? = U^\rn\'' 

where 

■= ""^n'^j = n-\D-'K,yA-^D-^Kj 

For the contribution := \KjWn\^ the Cauchy-Schwarz bound is too crude. 
Instead, notice that QV^ = which ensures that the complement of the set 

y^.e := {iV^^nl < ^/N^e} H {K < Uje} 
has Q probability less that 2e. On the set y^.e. 

The asserted bound follows. 

5.2. Proof of Lemma 2. Throughout this subsection abbreviate Fn.^^K to P. 

The matrix An is an average of n independent random matrices each of which 
is distributed like KH''4){j' DJ^), where X' = (Xq, Xi, . . . , J<n) with JsTq = 1 and 
the other Jsfj's are independent A^(0, l)'s. Moreover, by rotational invariance of the 
spherical normal, we may assume with no loss of generality that j'D'J^ = a + nJ^i, 
where 

Thus 

Bn = PXN'V^(a + kXi) = diag(F, ro/7V_i) 

where 

rj :=PNjV)(a + KNi) and F = 



ro ri 
ri r2 
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The block diagonal form of Bn simplifies calculation of spectral norms. 

\\B-% = \\dia.g{F-\r^^lN.i)\\2 

< max (||F"^||2, WrQ^lN^ih) < max ( ^° rg'M . 

\r0r2-rf ) 

Assumption (-(/;) ensures that both ro and r2 are Oj-(l). 

Continuity and strict positivity of ^\), together with max(|a| , k) = Oj-(l), ensure 
that Co := infa,K iiif|a;|<i ^"(0 + i^x) > 0. Thus 



e-'^'/^dx > 



Similarly 

V2^(ror2 - rl) = \/2^roPV^(a + - n/ro)^ 



^ coro y (x — ri/ro)^e ^ ^'^dx > coVq J x^e ^ ^"^dx. 

It follows that ll^^^lb = 0^(1). 

The random matrix An — Bn is an average of n independent random matrices 
each distributed like ?\f?\f'^(a + kXi) minus its expected value. Thus 

F\\An - BnWl < " Bnfp = Y,^^ . var (y^jJ^ki'ia + kJ^i)) . 

Assumption (-0) ensures that each summand is Ojr(l), which leaves us with a 
Ojr{N'^/n) = oj-(l) upper bound. 

5.3. Proof of Lemma 3. Let us temporarily write A' for A + 5 and write A for 

(A + A')/2 = A + 5/2. 

i-\\v\Qx.Qx') = j Vfx{y)fx'iy) 

exp (Ay - i^(A) - ^i;{X')) 



= exp (V^(A) - iV(A) - iV(A')) 
>l + V'(A)-i^(A)-iV'(A') 

That is, 

ii\Qx, Qy) < V'(A) + i'iX + 6)- 2V(A + (5/2). 
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By Taylor expansion in 6 around 0, the right-hand side is less than 

i<52^(A) + (^{X + 6*) - I'^iX - 6*/2)) 

where < \5*\ < \6\. Invoke inequality V'twice to bound the coefficient of (5^/6 in 
absolute value by 

i^iX) {Gm + lG{\5\/2))<l^{X)G{\S\). 
The stated bound simplifies some unimportant constants. 

5.4. Proof of Lemma 4. Without loss of generality, let us suppose T = 1. For 
s = 1/4, note that 

1/4 



Pexp(sl^,) = n,^^(l - 2sr,,)-V2 < exp {^^^^sr.^k) < 

by virtue of the inequality — log(l — t) < 2t for |t | < 1/2. With the same s, it then 
follows that 

P{maxj<„VFj > 4(logn + x)} 

< exp (— 4s(logn + x)) Pexp (maxj<„ sWi) 

<e-^-y Fexp(sWi). 

n ^ — ^i<n 

The 2 is just a clean upper bound for e^/^. 

5.5. Proof of Lemma 5. We shall first show some preliminary results that will 
be used in the main proof throughout Sections 5.5.1 to 5.5.5. In this section, for 
notational simplicity, we write ^* for X^j^^- 

Many of the inequalities in this section involve sums of functions of the 6j's. 
The following result will save us a lot of repetition. To simplify the notation, we 
drop the subscripts from Pn,iM,K- 

Lemma 7. 

( i) For each r > 1 there is a constant Cr = Cr for which 

' ^i&i"^ ^ ^ \ej - - \ Ci (1 + A;i+-^ log k) if r = 1 
{ ii) For each p, 
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Proof. For (i), argue in the same way as Hall and Horowitz (2007, page 85), 
using the lower bounds 



9j -0k\>{ 



Caj-'' ifj<k/2 

Ca\j - k\k-"-^ if k/2 <j<2k 
Cak-" ifj>2k 



where is a positive constant. 

For (ii), split the range of summation into two subsets: { {k, j) : j > max(p, 2A;)} 
and {{k,j) : p/2<k<p<j< 2k}. The first subset contributes at most 

because a — 2/3 < —3. The second subset contributes at most 

which is of order oj-(p~"). □ 
Now remember that 

so that 



Ok{j = k} = JJ K{s,t)cl)j{s)Mt)dsdt 

= in- ^-^Jhj - z-j){zi,k - z.k), 
which implies (n — 1)^^ X^j<n '^1% = and 

(25) (n-i)-iV.^ ^^?}[ = D-^&D-^ ■.= dmg{i,ei/ei,...,eN/eN). 

We will analyze K by rewriting it using the eigenfunctions for K. Remember 
that 2i j = and the standardized variables r]ij = Zi^jj ^JWj are indepen- 

dent A^(0, l)'s. Define z.j = {Z, (f)j) and rj.j = Yli<n Vi,j and 

Qj,k ■■= (n - X]i<„ ~ ^-j) ~ ' 

the {j, k)-element of a sample covariance matrix of i.i.d. A^(0, In) random vectors. 
Then 
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and 

(26) K{s, t) = Y,. Kj^k(pj(s)Mi) with Kj^k = ^/Wk^hk 

Moreover, as shown in Lemma 14 in the supplemental Appendix, the main contri- 
bution to fk = crk4>k- 4>k is 

A.:=V withA.,:=|^/^^-^/(^^-^^■) ^ ^ ' . 

Define 

efc := min{|6'j - 61^1 : j / A;}. 

The following two lemmas related to perturbation theory for self-adjoint com- 
pact operators (cf. e.g. Birman and Solomjak, 1987; Bosq, 2000; Kato, 1995) are 
crucial in the development of Lemma 5. They are special cases of Lemma 13 
and Lemma 15 in the Appendix under the general perturbation-theoretic frame- 
work. For Lemma 8, similar results were established by other authors see e.g. 
Hall and Hosseini-Nasab, 2006, equation 2.8 and Cai and Hall, 2006, Section 5.6. 
Lemma 9 extends the perturbation result for eigenprojections, obtained by Tyler 
(1981, Lemma 4.1), from the matrix case to the general operator case. 

Lemma 8. If Ck > 5||A||, it follows that 

ll/fcll <3||Afc||. 

Define Hj = spa.n{(/)j : j £ J} and Hj = span{0j : j G J} for J C N. 
Lemma 9. /fmin^gj > 5||A||, ?/je?T 

{Hj - Hj)M = Y,^^ j Y.keJ^ ^M^j,k + Afc,i) + e 
where | |e| p is bounded by a universal constant times Ri + 1 1 A| pi?2 with 

- E... iiA^-p (e; (E... iiA^iiiM e; 

+ y ||Afcf|6,|2A;2+2- 
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In fact, most of the inequalities that we need for analyzing the estimator B de- 
fined in (6) - (9) come from simple moment bounds (Lemma 10) for the sample 
covariances Qj ^ and the derived bounds (Lemma 11) for the A^'s. 

The distribution of Qj ^ does not depend on the parameters of our model. Indeed, 
by the usual rotation of axes we can rewrite (n — l)Cj ^ as UjUk, where Ui,U2, ■ ■ ■ 
are independent A^(0, In-i) random vectors. This representation gives some useful 
equalities and bounds. 

Lemma 10. Uniformly over distinct j, k, i, 

(i) PCj- j = landW {Qjj - if = 2{n - 1)-^ 

(ii) ¥e-,k = Fej.fcCj- ^ = 
(Hi) pe^^ = O(n-i) 

Proof. Assertion (i) is classical because |f/jp ~ Xn-i- For assertion (ii) use 

F{U[U2 I U2) = Oand 

F{U[U2U^U3 I U2) = trace {U2U^nU3U[)) = 0. 
For (iii) use F{UiU[) = /„_i and 

F{U[U2U2Ui I U2) = trace {U2U2r{UiU[)) = tTace{U2U2) = |C/2|^ 

□ 



Lemma 1 1 . Uniformly over distinct j, k, i, 

(i) PAfcj = ¥AkjAk,e = 

(ii) FAl^ = Or {n-^k-^j-^{eu - 9j)-^) 

(iii) ¥\\Ak\\^ = OAn~'k^) 

Proof. Assertions (i) and (ii) follow from Assertions (ii) and (iii) of Lemma 10. 
For (iii), note that 

F\\Akf = Y./^lk = Or{n~^k~^)Kk{2,a) 

□ 

To prove Lemma 5 we define „ as an intersection of sets chosen to make the 
six assertions of the Lemma hold, 

Xe,n XAjH ^ Xz,n ^ XA,n ^ Xj^^^ H "X^Ajni 
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where the complement of each of the five sets appearing on the right-hand side 
has probabihty less than e/5. More specifically, for a large enough constant C^, we 
define 

XA,n = {||A|| < 

3^z,n = {maxj<„ < Celognand ||Z|| < CeW^^'^} 

'^Ti,n = {niaxj<„ |r/j|^ < CeNlogn} as in Section 3.3 

The definition of XA,n> in subsection 5.5.3, is slightly more complicated. It is de- 
fined by requiring various functions of the 's to be smaller than times their 
expected values. 

The set XA,n is almost redundant. From Definition 1 we know that 

mill Wi - OiA > {alR)N-^-'^ and min 9^ > R-^N''^. 

^<j<j'<N l<j<N ■' 

The choice N -nf- with C < (2 + 2a)~i ensures that n^/^jy-i-" oo. On Xa,„ 
the spacing assumption used in Lemmas 8 and 9 holds for all n large enough; all 
the bounds from those lemmas are available to us on X^^n. In particular, 

max,-<;v \0j/9, - 1| < Oj-(iV°|| A||) = 

Equality (25) shows that X^.n ^ Xa,™ eventually if we make sure Ce > 1. 

5.5.1. Proof of Lemma 5 part (i). Observe that 

^ii^ii' = Y.,/{^i'^ - ^^^^ = ^}f = E,-fc^j-^'^^('5^> - i-?' = 

5.5.2. Proof of Lemma 5 part (ii). As before. Lemma 4 controls maxi<n, 

To control the Z contribution, note that n||Z|p has the same distribution as ||Zi|p, 
which has expected value J2jeN < co. 

5.5.3. Proof of Lemma 5 parts (Hi) and (iv). Calculate expected values for all 
the terms that appear in the bound of Lemma 9. 

^--^E,<, (E,>/^.^.y +ip^-M.A'E,>, (E.</^.^^)' 

= Ez,^ E- ^n,^,,KAlJ [b] + bl) by Lemma 11 part (i) 
(27) =0^(n"V"") by Lemma 7 
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and 

and 

and 
and 

(28) =Oj-(n^^) by Lemma 7 
and 

iiaiix..e.,,iia^-ii^(e;^)' 

(29) = 0^(n-"||A||2) (p=' +p='+2°-2''log2p) 

and 
(30) 

For some constant Ce = Ce{T), on a set Xa^u with Pn,/i,_ft:^A n < ^' ^^^^ of the 
random quantities in the previous set of inequalities (for both p = m and p = N) 
is bounded by times its Pn.^.ii" expected value. By virtue of Lemma 1 1 part (iii), 
we may also assume that ||Afc|p < C^k'^/n on XA,n- 

From Lemma 9, it follows that on the set XA,n H XA,n, ifp < N, 

\\iHp-Hp)Mf 

< O^n-^p^-^) + O^n-^) (l + /+2"-2/3 + logp + p6-/3 + log2 
+ OT{n~^P^)OAn-^) + O^n-^) (p^ + p5+2"-2^ log2 

+ 0^(n-V)O.F(l + /+2"-2/3 log2^) 

This inequality leads to the asserted conclusions when p = mor p = N. 
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5.5.4. Proof of Lemma 5 part (v). By construction, rjn = 1 for every i and, for 

i > 2, 

^/OjVij = - z.j) = (Zj - Z, 

Thus, for j > 2, 

CTjrjij = 9~^^'^{Zi - Z, cf>j + fj) = rjij + \j 
with, due to Lemma 8, 

In vector form, 

(31) Sl^i = Vi + Si with \6i\^ = Or f j < or{n/N^) on X,,„. 

It follows that 

maxj<„ \r]i\ = maxj<„ \Sr]i\ < maxj<„ \r]i\+OJr{^/n/N) = Ojr{y/n/N) on X^^n- 

5.5.5. Proof of Lemma 5 part (vi). From inequality (20) we know that 

CN ■■= maxj<„ \Xi^N - K,n\ = Oj-(n"^^+'^'^/^) on X^^n 

and from the Section 3.3 we have maxj<„ {Xi^^l = Ojr{\/ log n). Assumption (^) 
in Section 2 and the Mean- Value theorem then give 

maxi<„ \'ip{\i^N) - 'ip{\,N)\ < eN'ip{K,N)G{eN) = ojr{l). 

If we replace ^/^(Ai^Ar) in the definition of An by Lj := ^/^(Aj^Ar) we make a change F 
with 



|r||2<o^(i)||(n-i)-iJ^ 



which, by equality (25), is of order oj-(l) on X^^n- 

From Assumption (^) we have Cn := logmaxj<„Lj = oj:-(logn). Uniformly 
over all unit vectors u in M^+i we therefore have 



uSAnSu = ojr{l) + {n-l) ^V].^ Liu{rji + 6i){r]i + 6iy 
= 0^(1) + (l + 0(n-^)) u'AnU 

Rearrange then take a supremum over u to conclude that 

ll^InS - Anh < 0^(1) + 0^(e^") maxi<„ (\6i\^ + 2|5i| \r]i\ 



Representation (31) and the defining property of X-q^n then ensure that the upper 
bound is of order oj-(l) on X^^n- 
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6. Appendix. In this section, we introduce some useful results in spectral the- 
ory and perturbation theory. Some of the results are well-established. We briefly 
review them for the purpose of easy reference. For example, the results for eigen- 
values have become quite standard for decades (see, e.g. Dunford and Schwartz, 
1988, Chapter VII.6). We derive a bound for the perturbation of eigenprojections 
(Lemma 15) which plays a key role in the slope function estimation problem. This 
bound is closely related Proposition 2 in Cardot, Mas and Sarda (2007), which was 
tailored to solve the prediction problem at a random design. However, the two re- 
sults are different. A comparison between their result and our bound in Lemma 15 
is discussed later following Lemma 15. We could not find the same (or stronger) 
bound explicitly in the existing perturbation literature. 

The spectral theory and the perturbation theory in Hilbert spaces have been serv- 
ing as powerful tools that allow statisticians to tackle the statistical approximation 
problems in an elegant way. From Lemma 12 to Lemma 14 we shall review the 
well-established perturbation-theoretic results for eigenvalues and eigenvectors of 
positive and self-adjoint compact operators respectively. Our main contribution of 
this section is to extend the perturbation result for eigenprojections, obtained by 
Tyler (1981, Lemma 4.1), from the matrix case to the general operator case. Our 
perturbation result for eigenprojections will be introduced in Lemma 15. 

Suppose r is a positive and self-adjoint compact operator in a Hilbert space "K. 
According to the spectral theory for positive and self-adjoint compact operators 
(see e.g. Birman and Solomjak, 1987, Page 209), the operator T has a sequence of 
decreasing nonnegative eigenvalues {6i} and a sequence of corresponding eigen- 
vectors {cj}. That is, Tcj = OiCi with 9i > 62 > ■ ■ ■ > 0. Furthermore, T has the 
spectral decomposition 



which converges in the operator norm. 

In this section, the perturbation-theoretic results are the functional analysis re- 
sults without involving randomness. The focus here is on the results for positive 
and self-adjoint compact operators, and for more general discussion on perturba- 
tion theory of linear operators please see, for example, Kato (1995). More precisely, 
let T be another positive and self-adjoint compact operator in Ji with spectral de- 
composition 



The eigenprojection of the operator T associated with eigenvalues ©j := {6j : 
j G J}, denoted by Hj, is the orthogonal projection onto the eigenspace of T 
associated with Qj, that is, spanje, : j € J}. In fact, we have Hj = ^ 



(32) 




(33) 
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Cj. Analogously, the eigenprojection Hj can be defined for the operator T. We 
shall study how well the differences of 9j — 6j, Cj — ej, and Hj — Hj can be 
controlled by A = T — T, given that 6 := ||A||2 is small. 

In statistical applications, the operator T is usually taken as unknown, while the 
operator T taken as the estimation of T. Perturbation theory suggests that as long as 
T approximates T well, the eigen-elements of T can project the analogous eigen- 
elements of T well. This idea has been explored and utilized by Tyler (1981), Bosq 
(2000), Cai and Hall (2006), Hall and Horowitz (2007), and Cardot, Mas and Sarda 
(2007), among others. More interestingly. Hall and Hosseini-Nasab (2006) pro- 
poses a Taylor-expansion type of approximation of eigenvectors which is better 
adapted to the statistical approximation purposes. 

In the application of Section 3.4, we draw probabilistic conclusions when T is 
random for the special case where T = K, the population covariance kernel, and 
T = K, the sample covariance kernel, both acting on IK = £^[0, 1]. The eigenvec- 
tors {gj} and {ci} will be principal components {(pi} and {(pi} respectively. 

Before formally illustrating the perturbation-theoretic results in details, we shall 
introduce some necessary notations and basic mathematical relations here. Because 
{ej : j G N} forms a complete orthonormal basis for the Hilbert space "K, the 
operator T also has the following representation 

(34) ^ = E,,,eN^^>^^-^"^ 

which converges in the operator norm. 

Note that Tj ^ = T^j because T is self-adjoint. This representation gives 



^ = E,,eN {Tj,k-0j{j = k}]ej^ek 



and 

6' = II Af = sup|[,||=i(x, Ax)2 < Y,.,^^ {Tj,k - e,{j = k} 
We also define 

(35) Afc := V Afcje,- with A^,, := <^ ^ ' " . 

iO ifj = k 

Notice that {e^ : /c G N} is also an orthonormal basis for "K. Define cjj^k '■= (cj , e^) 
Then 

e,- = > (T,- uik and ei. = > o",- u^i 



and 

{j = f} = {ej,ef) = Y,^,^(^j,k(7j',k 
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We cannot hope to find a useful bound on ||efc — efc||, because there is no way to 
decide which of zte/j should be approximating e^. However, we can bound ||/fc||, 
where 



(36) fk = CTkCk - Ck with fjfc := sign {ak,k) ■= 



+1 if CTfc.fc > 
— 1 otherwise 



which will be enough for our purposes. 

To simplify notations, write ^* for T^jenij ^ ^} and ^,1 for T^keni^ ^ j} 
in this section. 

The following lemma has been proved in multiple places. Here we stated it with 
a brief proof for the purpose of easy reference. 

Lemma 12. Suppose T and T are two positive and self-adjoint operators with 
spectral decompositions (32) and (33), then it follows that 

(37) \Oj-ej\<d for all JEN. 

Proof. The eigenvalues have a variational characterization; see Bosq (2000, 
Section 4.2) or Birman and Solomjak (1987, Chapter 9): 

(38) 6j = inf sup{(x,Tx) : x _L L and ||x|| = 1}. 

dim(L)<j 

The first infimum runs over all subspaces L with dimension at most j — 1. (When j 
equals 1 the only such subspace is 0.) Both the infimum and the supremum are 
achieved: by Lj^i = spanjej 1 < i < j} and x = ej. Similar assertions hold 
for T and its eigenvalues. 
By the analog of (38) for f, 

9j < sup{{x,Tx) : X _L Lj^i and = 1} 

< sup{(x,Tx) + 6 : X -L Lj_i and ||x|| = 1} = 6j + 6. 

Argue similarly with the roles of T and T reversed to conclude the result. □ 

In order to approximate an eigenvector reasonably well, we need to assume 
that the eigenvalue 6k is well separated from the other Oj's, to avoid the problem 
that the eigenspace of T for the eigenvalue 9k might have dimension greater than 
one. More precisely, we consider a k for which 

(39) efc := mm{\e, - Okl ■ j ^ k] > 55, 
which implies 

\&k — 0j\ > \0k — 6j\ — S > ^\0k — Oj\ > fcfc- 
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The following lemmas provide approximative results for fk under the assump- 
tion that Efc > 55. Similar results were established by other authors; see, for exam- 
ple, Hall and Hosseini-Nasab, 2006, equation 2.8 and Cai and Hall, 2006, Section 
5.6. 

Lemma 13. Suppose T and T are two positive and self-adjoint operators with 
spectral decompositions (32) and (33). The vectors {A^} and {fk} are defined as 
in (35) and (36) respectively. Then ife^ > 56, it follows that 

ll/fcll <3||Afc||. 

Proof. The starting point for our approximations is the equality 
(40) {Aek,ej) = {Tek, ej) - (ck^Tej) = {6k - Oj)aj^k- 

For j kwe then have 
16 

which implies 

25 ~ 
all, < —{Afk, ejf/el + 2r|j(^fe - Ojf because (Te^, e,) = for j / k. 

The introduction of the ak also ensures that 

ll/fclP = + llcfclP - 2ak{ek,ek) = 2- 2\ak,k\ 

< 2 — 2al k because \ak^k\ < 1 

The first sum on the right-hand side is less than 

^||AM|V4<5^||MlV(4<5^) = IIMlV4. 
The second sum can be written as 25||Afc|p/4. Then, 

\\fkf<f\\Akf<9\\Akf. 



□ 
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Lemma 14. Suppose T and T are two positive and self-adjoint operators with 
spectral decompositions (32) and (33). The vectors {A^} and {fk} are defined as 
in (35) and (36) respectively. Then if > 56, the corresponding operator has 
the representation: 

fk = Ak + Vk 



with 

{rk,ek) = -^Wfkf and \{rk,ej)\ < -^^^^^ ^ j 7^ k. 
Z \tlk — Oj 



Proof. Start once more from equality (40). For j / k, 

(^k<yj,k = <yk{^ek,ej)/{9k - 6j) 

= (A(efc + fk),ej)/{9k + 7fc - 0j) where jf, = Ok- Ok 

= K3 fl - ^^Vl + ^^^^^ because (Tcfc, ej) = 

Ok-Ok ^ , {Afk,ej) 



(41) = Ak,j + rk,j where r^j := ^Afcj + 

Oj — tJk 



5 fS\Akj\ + \{Afk,ej, 



The Tkj's are small: 

M < ^ [ "'"^"lejre'/"'" ) for J + ^' if > 

(42) < " ^" by Lemma 13. 

\0k — c'jl 

Define Vk^k = Wk,k\ - 1 = and = EjeN ^^j-ej- Combine (35) and 

(41), we then have the representation: 

(43) fk = CTkCk - ek = {(Tk{ek,ek) - 1) Ck + CFkCTj^kdj = ^k + rk- 

□ 

In the rest of this section, we shall establish an approximation for HjM — HjM 
for a B = bjCj in !K, an extension of the finite-dimensional perturbation result 
Tyler (1981, Lemma 4.1) to the case of general infinite-dimensional operators. 
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The difference Hj — Hj equals 

E, Jo-fcefc) (cTfcefe) - efe (g) efe 
keJ 

E CTfcefc (g) Tfc + fcfc + /fc) (g Afc 

+ y~!, f(efc + Afc + rfc) (g) efc - efc (g) efc) 
■■^ — 'fee J 

= ^J + y^, efc (g Afc + Afc (g efc 

(44) where 7^J := V crfcCfc g) + (g Afc + rfc (g e^. 

^ — 'kg J 

Self-adjointness of T imphes Tj ^ = j and hence Aj ^ = — A^, j. The anti- 
symmetry eliminates some terms from the main contribution to Hj — Hj: 

(45) efc (g Afc + Afc (g Cfc = V] . A^j (e^ (g + e^- (g e^) . 
With this simplification we get the following representation for {Hj — Hj)M: 

{Hj - Hj)M = Y,^^^ Y^keJ^ eMHk + Afcj) + ^JIB- 

For the three contributions to the bound for ||7^jB|p we make repeated use of the 
inequalities, based on Lemma 13 and Lemma 14, 

(46) \{rk,x)\ < -||Afc||||/fc|||xfc| +55||Afc||^ 



3 \Uk - t/jl 

(47) <|||A.|PM+5i||A.||j:;^ 

which is valid whenever eu > 55. Combine (46) and the following well-known 
inequality (see e.g. Hall and Horowitz, 2007, Equ. 5.2): 

IIMI < 2^/25 mmiOk-i - ^fc,^fc - ^fc+i}"^ = 2^/25e^\ 

we get 

Krfc,x)| < C6\\Kk\\ (el'\xk\ + Y.] ]e^^ ' 

To avoid an unnecessary calculation of precise constants, we adopt the convention 
of the variable constant: we write C for a universal constant whose value might 
change from one line to the next. The first two contributions are: 



2 



<C6^y^ IIAfci 
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and 



sc(E.«l|A.f)E.,,,(E•A«^)^ 



For the third contribution, let x = J2j ^j^j be an arbitrary unit vector in "K. Then 



(48) 



(49) 



take the supremum over x, which doesn't even appear in the last line, to get the 
same bound for || J2keJ ^fe^'fclP- 

In sum, we can obtain the following lemma: 

Lemma 15. Ifmmk^jek > 55, then 



where TZj is defined in (44) and | |7^jB| p is bounded by a universal constant times 
Ri + 6'^R2 with 



«. = (E,«iia.iP)E.,,(E>^.a)' 

- E.,. iiA^ii^ (e; (e.,. IIA.IIIM e; ^) 



This lemma is the keystone to establish the parts (iii) and (iv) in Lemma 5. It is 
similar to Proposition 2 in Cardot, Mas and Sarda (2007) in the sense that both deal 
with the the approximation problems of eigenprojections. Particularly, we observe 
that the same trick of using anti-symmetry is applied to eliminate some terms from 
the main contributions to the approximation errors (see Equation (45) in this section 
and Equation (23) in Cardot, Mas and Sarda (2007)). However, the two results are 



(Hj - Hj)M = Y^.^j Y^k^j. ^M^],k + Afcj) + UjM 
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different. The other authors consider the bound for {{Hj — Hj)M, which is 

motivated by the prediction problem at a random design, whereas we establish a 
bound for 1 1 {Hj — Hj)M\\ which is relevant to the slope function estimation prob- 
lem. More precisely, the independent randomness of helps cancel off many 
cross-product terms and accelerates the decay rates of the summands in the expan- 
sion. See, for example. Equation (24) and (25) in Cardot, Mas and Sarda (2007). 
Due to the 'smoothing' effect of the independent random curve X^+i, we cannot 
directly apply the convergence result in Cardot, Mas and Sarda (2007, Proposition 
2) to our case. Besides, the bound in the lemma above is a pure mathematical 
perturbation-theoretic result not involving any randomness treatment, which we 
believe is a potentially more general result. 

Statistics Department 
Yale University 

E-MAIL: Winston.Wei.Dou@aya.yale.edu 
E-MAIL: David.Pollard@yale.edu 
E-MAIL: Huibin.Zhou@yale.edu 
URL: http://www.stat.yale.edu/ 



